Lesson 5 - Crawl and Scrape

Making the request

Using 'requests' module

Use the requests module to make a HTTP request to http://www.tripadvisor.com

  • Check the status of the request
  • Display the response header information

In [ ]:

Get the '/robots.txt' file contents


In [ ]:

Get the HTML content from the website


In [ ]:

Scraping websites

Sometimes, you may want a little bit of information - a movie rating, stock price, or product availability - but the information is available only in HTML pages, surrounded by ads and extraneous content.

To do this we build an automated web fetcher called a crawler or spider. After the HTML contents have been retrived from the remote web servers, a scraper parses it to find the needle in the haystack.

BeautifulSoup Module

The bs4 module can be used for searching a webpage (HTML file) and pulling required data from it. It does three things to make a HTML page searchable-

  • First, converts the HTML page to Unicode, and HTML entities are converted to Unicode characters
  • Second, parses (analyses) the HTML page using the best available parser. It will use an HTML parser unless you specifically tell it to use an XML parser
  • Finally transforms a complex HTML document into a complex tree of Python objects.

This module takes the HTML page and creates four kinds of objects: Tag, NavigableString, BeautifulSoup, and Comment.

  • The BeautifulSoup object itself represents the webpage as a whole
  • A Tag object corresponds to an XML or HTML tag in the webpage
  • The NavigableString class to contains the bit of text within a tag

Read more about BeautifulSoup : https://www.crummy.com/software/BeautifulSoup/bs4/doc/


In [ ]:
<h1 id="HEADING" property="name" class="heading_name   ">
    <div class="heading_height"></div>
     "
     Le Jardin Napolitain
     "
</h1>

Step 1: Making the soup

First we need to use the BeautifulSoup module to parse the HTML data into Python readable Unicode Text format.

Let us write the code to parse a html page. We will use the trip advisor URL for an infamous restaurant - https://www.tripadvisor.com/Restaurant_Review-g187147-d1751525-Reviews-Cafe_Le_Dome-Paris_Ile_de_France.html


In [ ]:

Step 2: Inspect the element you want to scrape

In this step we will inspect the HTML data of the website to understand the tags and attributes that matches the element. Let us inspect the HTML data of the URL and understand where (under which tag) the review data is located.


In [ ]:
<div class="entry">
    <p class="partial_entry">
    Popped in on way to Eiffel Tower for lunch, big mistake. 
    Pizza was disgusting and service was poor. 
    Its a shame Trip Advisor dont let you score venues zero....
    <span class="taLnk ulBlueLinks" onclick="widgetEvCall('handlers.clickExpand',event,this);">More
    </span>
    </p>
</div>

Step 3: Searching the soup for the data

Beautiful Soup defines a lot of methods for searching the parse tree (soup), the two most popular methods are: find() and find_all().

The simplest filter is a tag. Pass a tag to a search method and Beautiful Soup will perform a match against that exact string.

Let us try and find all the < p > (paragraph) tags in the soup:


In [ ]:

Step 4: Enable pagination

Automatically access subsequent pages


In [ ]:

Using yesterdays sentiment analysis code and the corpus of sentiment found in the word_sentiment.csv file, calculate the sentiment of the reviews.


In [ ]:
#Enter your code here

Expanding this further

To add additional details we can inspect the tags further and add the reviewer rating and reviwer details.


In [ ]:

Using the review data and the ratings available is there any way we can improve the corpus of sentiments "word_sentiment.csv" file?


In [ ]:

Dynamic Pages

Some websites make request in the background to fetch the data from the server and load it into the page dynamically (often an AJAX request). In this case, the url will not indicate the location of the data. To find such requests, open the Chrome or Firefox Developer Tools, you can load the page, go to the “Network” tab and then look through the all of the requests that are being sent in the background to find the one that’s returning the data you’re looking for. Start by filtering the requests to only XHR or JS to make this easier.

Once you find the AJAX request that returns the data you’re hoping to scrape, then you can make your scraper send requests to this URL, instead of to the parent page’s URL. If you’re lucky, the response will be encoded with JSON which is even easier to parse than HTML.


In [ ]:

Spoofing the User Agent

By default, the requests library sets the User-Agent header on each request to something like “python-requests/3.xx.x”. You can change it to identify your web scraper, perhaps providing a contact email address so that an admin from the target website can reach out if they see you in their logs.

More commonly, this can be used to make it appear that the request is coming from a normal web browser, and not a web scraping program.


In [ ]:
header = {
    'cookie': 'TAUnique=%1%enc%3AHvAwOscAcmfzIwJbsS10GnXn4FrCUpCm%2Bnw21XKuzXoV7vSwMEnyTA%3D%3D; fbm_162729813767876=base_domain=.tripadvisor.com; TACds=B.3.11419.1.2019-03-31; TASSK=enc%3AABCGM1r6xBekOjRaaQZ3QVS7dP4cwZ8sombvPTq8xK6xN55i7TN8puwZdwvXvG1i%2FJ2UQXYG1CwsU%2BXLwLs5qIxnmW5qbLt4I48DfK5FhHpwUw3ZgrbskK%2FjDc4ENfcCXw%3D%3D; ServerPool=C; TART=%1%enc%3A8yMCW7EtdBqPX0oluvfOS5mBk6DRMHXwNEAPJlcpaDumiCWsxs%2BxfBbTYsxpa%2F9l%2FJzCllshf9g%3D; VRMCID=%1%V1*id.10568*llp.%2FRestaurant_Review-g187147-d3405673-Reviews-La_Terrasse_Vedettes_de_Paris-Paris_Ile_de_France%5C.html*e.1557691551614; PMC=V2*MS.36*MD.20190505*LD.20190506; PAC=ALNtqHPT2KJjQwExTPJt3gCvzvDYH_x63ZOT4b3LetvkHuHXcEUY4eLx0TqKGzOIpoXF3K_j57rNigUkWJzSv7TtTna4L3DKcfiaeK9zT9ixGEevH6QwZVd-PdMyr9y5aRzjEVAfid42zC4WXeTcQTJkPVwGMCW2mB2k3xxfB78GgJFIR_I9vf6Bzhq89x_UTTUcQgFpCr8GEFV9GpJWG8UNGeriJSbmPtCXA10oXl5ox7U9TQvSILLSH8PdrP8nwUQMRnfUA_fKbXTaRgH4tzBwZQpbd1vlOOg7fKyfIN9V95PzNOXBEQCJIo3z09Nux0tyZZVX0PX_zI_moLpr9Od3eSi1E8Hm5QcLyG9QNfA1C5WckG9GOV5VKEL0bxDY5TG1smCaQDXpRLkvp8w2bD7vyI2e27WFbtuYvJDJ126v2_KyZmVbG3laZlvWrX2kWGL13IyhVS2Ivjr_9uJAwMpBKuNByH0FBU3ziJcRdqkXiz6lnYMSRSQ1Y8Dmkjkrc0DNTABvuHjbZ7Fh0LOINswW_wrkVsP4PjDq1IVh7IY0hLE_W1G1DKlROc5BZEOjcw%3D%3D; BEPIN=%1%16a8c46770b%3Bbak92b.b.tripadvisor.com%3A10023%3B; TATravelInfo=V2*A.2*MG.-1*HP.2*FL.3*DSM.1557131589173*RS.1*RY.2019*RM.5*RD.6*RH.20*RG.2; CM=%1%RestAds%2FRPers%2C%2C-1%7CRCPers%2C%2C-1%7Csesstch15%2C%2C-1%7CCYLPUSess%2C%2C-1%7Ctvsess%2C%2C-1%7CPremiumMCSess%2C%2C-1%7CRestPartSess%2C%2C-1%7CUVOwnersSess%2C%2C-1%7CRestPremRSess%2C%2C-1%7CPremRetPers%2C%2C-1%7CViatorMCPers%2C%2C-1%7Csesssticker%2C%2C-1%7C%24%2C%2C-1%7Ct4b-sc%2C%2C-1%7CMC_IB_UPSELL_IB_LOGOS2%2C%2C-1%7CPremMCBtmSess%2C%2C-1%7CLaFourchette+Banners%2C%2C-1%7Csesshours%2C%2C-1%7CTARSWBPers%2C%2C-1%7CTheForkORSess%2C%2C-1%7CTheForkRRSess%2C%2C-1%7CRestAds%2FRSess%2C%2C-1%7CPremiumMobPers%2C%2C-1%7CLaFourchette+MC+Banners%2C%2C-1%7Csesslaf%2C%2C-1%7CRestPartPers%2C%2C-1%7CCYLPUPers%2C%2C-1%7CCCUVOwnSess%2C%2C-1%7Cperslaf%2C%2C-1%7CUVOwnersPers%2C%2C-1%7Csh%2C%2C-1%7CTheForkMCCSess%2C%2C-1%7CCCPers%2C%2C-1%7Cb2bmcsess%2C%2C-1%7CSPMCPers%2C%2C-1%7Cperswifi%2C%2C-1%7CPremRetSess%2C%2C-1%7CViatorMCSess%2C%2C-1%7CPremiumMCPers%2C%2C-1%7CPremiumRRPers%2C%2C-1%7CRestAdsCCPers%2C%2C-1%7CTrayssess%2C%2C-1%7CPremiumORPers%2C%2C-1%7CSPORPers%2C%2C-1%7Cperssticker%2C%2C-1%7Cbooksticks%2C%2C-1%7CSPMCWBSess%2C%2C-1%7Cbookstickp%2C%2C-1%7CPremiumMobSess%2C%2C-1%7Csesswifi%2C%2C-1%7Ct4b-pc%2C%2C-1%7CWShadeSeen%2C%2C-1%7CTheForkMCCPers%2C%2C-1%7CHomeASess%2C9%2C-1%7CPremiumSURPers%2C%2C-1%7CCCUVOwnPers%2C%2C-1%7CTBPers%2C%2C-1%7Cperstch15%2C%2C-1%7CCCSess%2C2%2C-1%7CCYLSess%2C%2C-1%7Cpershours%2C%2C-1%7CPremiumORSess%2C%2C-1%7CRestAdsPers%2C%2C-1%7Cb2bmcpers%2C%2C-1%7CTrayspers%2C%2C-1%7CPremiumSURSess%2C%2C-1%7CMC_IB_UPSELL_IB_LOGOS%2C%2C-1%7Csess_rev%2C%2C-1%7Csessamex%2C%2C-1%7CPremiumRRSess%2C%2C-1%7CTADORSess%2C%2C-1%7CAdsRetPers%2C%2C-1%7CMCPPers%2C%2C-1%7CSPMCSess%2C%2C-1%7Cpers_rev%2C%2C-1%7Cmdpers%2C%2C-1%7Cmds%2C1557131565748%2C1557217965%7CSPMCWBPers%2C%2C-1%7CRBAPers%2C%2C-1%7CHomeAPers%2C%2C-1%7CRCSess%2C%2C-1%7CRestAdsCCSess%2C%2C-1%7CRestPremRPers%2C%2C-1%7Cpssamex%2C%2C-1%7CCYLPers%2C%2C-1%7Ctvpers%2C%2C-1%7CTBSess%2C%2C-1%7CAdsRetSess%2C%2C-1%7CMCPSess%2C%2C-1%7CTADORPers%2C%2C-1%7CTheForkORPers%2C%2C-1%7CPremMCBtmPers%2C%2C-1%7CTheForkRRPers%2C%2C-1%7CTARSWBSess%2C%2C-1%7CRestAdsSess%2C%2C-1%7CRBASess%2C%2C-1%7Cmdsess%2C%2C-1%7C; fbsr_162729813767876=wtGNSIucBSm5EusyRkPyX_GfZwxNkyHLxTRli46iHoM.eyJjb2RlIjoiQVFBUHV3SlZpOVNXQXVkMDh1bUdaYjZ2R3hBMkdfdFBZdm9Bb2l2cDEzSDNvaG1ESjRkamo1V1A3dnB5WloxWmwzeWxFTmdCT0dCbTB6dzc1S2pwUHFKak5nQVNKMGNqOEtvUVY1YzZXNHhNQ1FlMURNNXJOUUpMeEJldjlBS2xKNnhVVjVXQ1ZaajZjN1k4X1ZWeGdxbzlIclhKT3BvUDZSLTVzNkVUZ3Q5Q0xMNmg0ZnZIY0pMSm1KdXJwN0lGVFBSOUdvX0Z4M0FiM0VWQ1RnVFNGNzc2NFFuU29fdER5VFk3TWY0V0VKSFZXZi11ME1pa2ZWS1ZzUHdHQlBOOE1xZkVQNjZfZHpZMVdnSEVfcWR4d2FHN2xNODNyR1BWaDVwdDdodlFQQmFBbGtzU21IYjZiSktEaGVGajM4WTg3TGxUUF9hNEVGUjVjOVdoOVNhY2RmV04iLCJ1c2VyX2lkIjoiMTY1NjQ2NDcxNSIsImFsZ29yaXRobSI6IkhNQUMtU0hBMjU2IiwiaXNzdWVkX2F0IjoxNTU3MTMzMTgxfQ; TAReturnTo=%1%%2FRestaurants-g304551-New_Delhi_National_Capital_Territory_of_Delhi.html; roybatty=TNI1625!APyGsDM6tcKypRo49myenvbO5Zyk367lJP3JEhTSBrfno%2F4Bbienyfvs6Q2DU%2F2UmkzjN1pKquiSNGeY2cXQm8s8oX1jKwXT8hgK3GL%2B6psZHdp4k7TF4F52uoI2kQ1e9Ni2k9Ub8D5ak%2FXgN%2F9as9m2HZIB0G6SZnZMT%2FPD73Fo%2C1; SRT=%1%enc%3A8yMCW7EtdBqPX0oluvfOS5mBk6DRMHXwNEAPJlcpaDumiCWsxs%2BxfBbTYsxpa%2F9l%2FJzCllshf9g%3D; TASession=V2ID.2C4059CFCBC27797DA97994A5CF94A28*SQ.233*LS.PageMoniker*GR.7*TCPAR.44*TBR.80*EXEX.60*ABTR.87*PHTB.57*FS.2*CPU.54*HS.recommended*ES.popularity*DS.5*SAS.popularity*FPS.oldFirst*LF.en*FA.1*DF.0*IR.4*TRA.false*LD.304551; TAUD=LA-1557055610999-1*RDD-1-2019_05_05*RD-75954750-2019_05_06.9784431*HDD-75978369-2019_05_19.2019_05_20.1*HC-76743574*LG-77588176-2.1.F.*LD-77588177-.....',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36'
}


response = requests.post("https://www.tripadvisor.com/RestaurantSearch?Action=PAGE&geo=304551&ajax=1&itags=10591&sortOrder=relevance&o=a30&availSearchEnabled=false", headers=header)